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(g) Method and apparatus for converting bitmap image documents to editable coded data using a 
standard notation to record document recognition ambiguities. 
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@ Documents represented as bitmap images 
(S100)are transformed into coded textual data 
(S120) and coded graphics data (S160) by 
graphics and textual recognizers, which use a 
standard notation for recording the results of 
the document recognition processes, including 
any ambiguities, in a document description lan- 
guage. Recognized portions of the document, 
represented as editable coded data, such as for 
example ASCII, are placed in elements, defined 
in the document description language, with all 
contents of an element sharing some common 
characteristic. Elements can include, for 
example : character-string-elements (S140), 
questionable-character-elements (S1 50), ques- 
tionable-word-elements, verified-word-ele- 
ments, altemative-word-elements, segment- 
elements, and arc-elements. Each element in- 
cludes editable coded data, which also includes 
uncertainty information (S155) identifying any 
coded data which was not transformed with a 
predetermined level of confidence. 
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The present invention relates to document recog- 
nition, and in particular to methods and apparatus for 
recognizing textual and graphics structures in docu- 
ments originally represented as bitmap images, and 
for recording the results of the recognition process. s 

Document recognition is the automatic transfor- 
mation of paper documents into editable electronic 
documents. It entails the gradual transformation of 
bitmaps into structured components, through succes- 
sive and recursive interventions of various processes. 10 
These processes include: page segmentation, char- 
acter recognition, graphics recognition, logical struc- 
ture reconstruction, spelling correction, semantic 
analysis, etc. All these processes are prone to misin- 
terpretation. Not all processes keep a record of the 15 
misinterpretations they are aware of, and the ones 
that do keep a record have no standard way of doing 
so. As a consequence, downstream processes are 
generally not prepared to handle the record of ambi- 
guities handed to them by upstream processes, and 20 
simply discard them. Valuable information is lost in- 
stead of being exploited for automatic improvement of 
the document recognition function. If, on the other 
hand, the ambiguity record is passed in its raw state 
to the user, the chore of making manual corrections 25 
can quickly outweigh the advantages of automatic 
recognition over a manual reconstruction of the entire 
document. 

US- A-4,91 4,709 and 4,974,260 to Rudak dis- 
close an apparatus and method for identifying and 30 
correcting characters which cannot be machine read. 
A bitmap video image of the unrecognized charac- 
ters) is inserted in an ASCII data line of neighboring 
characters, thereby allowing an operator to view the 
character(s) in question in context to aid in proper 35 
identification of the characters). Subsequently, with 
the aid of the video image, the operator enters the 
correct characters) via a keyboard or other means. 
This apparatus and method require operator interac- 
tion to clarify any ambiguities resulting from an auto- 40 
matic document recognition process. The results of 
these ambiguities are not recorded in a notation that 
can be used by other downstream automatic devices. 

US-A-4,907,285 to Nakano et al discloses an im- 
age recognition system which uses a grammar for de- 45 
scribing a document image, and parses statements 
expressed by the grammar to recognize the structure 
of an unknown input image. The grammar describes 
the image as substructures and the relative relation 
between them. In the parsing process, after the sub- 50 
structures and their relative relation are identified, a 
search is made as to whether the substructures and 
their relative relation exist in the unknown input im- 
age, and if they do, the inside of the substructures are 
further resolved to continue the analysis. If the sub- 55 
structures to not exist, other possibilities are search- 
ed and the structure of the unknown input image is 
thus represented from the result of the search. For 



example, the location of a rectangular region of the 
document which contains a statement defined by the 
document grammar (for example TITLE" and "AU- 
THOR") is initially represented by variables. See Fig- 
ure 10 of US-A-4,907,285. After locating this region 
in the document, the appropriate numeric values are 
substituted for the variables. 

US-A-4,949,188 to Sato discloses an image proc- 
essing apparatus for synthesizing a character or 
graphic pattern represented by a page description 
language and an original image. The image process- 
ing apparatus generates a page description language 
including code data which represents characters, 
graphics patterns, and the like, and command data 
which causes a printer to print the original image. Am- 
biguities from previous document recognition proc- 
esses are not recorded in the page description lan- 
guage. See, for example, the table in column 4, lines 
5-10. Accordingly, any downstream device receiving 
the page description language cannot determine 
whether any ambiguities occurred in the previously 
performed document recognition processes. 

US-A-4,654,875 to Srihari et al discloses a meth- 
od of automatic language recognition for optical char- 
acter readers. Language in the form of input strings 
or structures is analyzed on the basis of: channel 
characteristics in the form of probabilities that a letter 
in the input is a corruption of another letter the prob- 
abilities of the letter occurring serially with other rec- 
ognized letters that precede the letter being analyzed 
or particular strings of letters occurring serially; and 
lexical information in the form of acceptable words 
represented as a graph structure. Ambiguities from 
upstream recognition processes are not recorded. 

"Word Association Norms. Mutual Information, 
and Lexicography", by Kenneth W. Church and Pat- 
rick Hanks, Computational Linguistics, Vol. 16, No. 1 
(March 1990) discloses a measure, referred to as an 
"association ratio" based on the information theoretic 
notion of mutual information, for estimating word as- 
sociation norms from computer readable corpora. 
This association ratio can be used by a semantics 
analyzer to determine a most likely word from a 
choice of two or more words that have been identified 
as possible words. 

"On the Recognition of Printed Characters of Any 
Font and Size", by Simon Kahan, Theo Pavlidis and 
Henry S. Baird, IEEE Transactions on Pattern Analy- 
sis and Machine Intelligence, Vol. PAM1-9, No. 2 
(March 1987), discloses a system that recognizes 
printed text of various fonts and sizes for the Roman 
alphabet. Thinning and shape extraction are per- 
formed directly on a graph of the run-length encoding 
of the binary image. The resulting strokes and other 
shapes are mapped, using a shape-clustering ap- 
proach, into binary features which are then fed into a 
statistical Bayesian classifier. This system identifies 
multiple possible characters or words, and scores 
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them. However, the uncertainty in the recognition 
processes is not recorded using the standard notation 
of the present invention. 

In summary, a number of systems exist which can 
recognize graphics structures, text (characters, 
words, semantics, fonts) and logical structures (pa- 
ges, paragraphs, footnotes), and which can deter- 
mine the uncertainty with which the recognized fea- 
ture was recognized. However, none of these sys- 
tems record the results of the recognition process (in- 
cluding uncertainties) in a manner which can be used 
by other devices. This results in much information 
(particularly regarding uncertainty) being lost, espe- 
cially when different recognition systems (e.g., char- 
acter recognizers, word recognizers, semantics ana- 
lyzers) are used at different times (as opposed to be- 
ing integrated into one system). 

It is an object of the present invention to provide 
methods and apparatus for recording ambiguities in 
document recognition processes in a standard format 
which can be used by a variety of document recogniz- 
ers. 

It is another object of the present invention to pro- 
vide methods and apparatus for converting bitmap 
images into editable coded data, wherein information 
regarding ambiguities in the transformation process- 
es performed by upstream recognizers can be re- 
corded and thus used by downstream, higher level 
recognizers which attempt to resolve these ambigui- 
ties. 

To achieve the foregoing and other objects, and 
to overcome the shortcomings discussed above, 
methods and apparatus are provided for converting 
documents represented as bitmap image data into 
editable coded data, wherein a standard notation in a 
document description language is utilized for record- 
ing document recognition ambiguities by each docu- 
ment recognizer. When the results of document rec- 
ognition processes are recorded using this standard 
notation, any ambiguities are identified in a uniform 
manner so that downstream, higher level document 
recognition processes can attempt to resolve these 
ambiguities by using all information about the ambi- 
guities obtained by upstream document recognition 
processes. 

In particular, when using the standard notation of 
the present invention, each document recognizer can 
record the results of its recognition process in one or 
more elements, selectively identified using the docu- 
ment description language. Each element includes a 
type-identifier indicating a type of coded data (infor- 
mation) regarding the recognized (transformed) bit- 
map image contained therein. Each element also in- 
cludes editable coded data therein of the type identi- 
fied by the type-identifier, and can also include un- 
certainty information identifying any coded data 
which was not transformed with a predetermined lev- 
el of confidence. This uncertainty information is de- 



termined by the document recognizer, and is record- 
ed in a format that is readable by higher level, down- 
stream document recognizers. This uncertainty infor- 
mation can include the level of conf idence with which 

5 the uncertain coded data was recognized by the 
document recognizer, in order to further assist the 
higher level document recognizers in resolving ambi- 
guities. The uncertainty information can also include 
alternative coded data for each uncertain recognition. 

10 When the document recognizer is a character 
recognizer, any characters which are not recognized 
with a predetermined level of confidence are identi- 
fied and recorded by placing them in questionable- 
character- elements. The degree of certainty as well 

15 as alternative possible characters and their degree of 
certainty can also be recorded for each questionable 
character. Characters which were recognized with at 
least the predetermined level of confidence are 
placed in character-string-elements. 

20 When the document recognizer includes a word 
recognizer (such as, for example, a spelling checker), 
the word recognizer attempts to resolve any existing 
questionable characters by determining whether any 
words exist in a lexicon based upon each question- 

25 able character and the certain characters in the word 
containing each questionable character. If a word is 
identified in the lexicon for the word containing a 
questionable character, that word is identified as a 
verified word, and is recorded in a verif ied-word-ele- 

30 ment. If more than one verified words are found, they 
are placed in individual verif ied- word-elements which 
are collectively grouped together in an alternative- 
word-element. If no verified words are found for the 
word containing a questionable character, the ques- 

35 tionable-character-element remains. 

When the document recognizer includes a se- 
mantics analyzer, any identified alternative verified 
words are resolved by analyzing the words surround- 
ing the alternative verified words. If one of the alter- 

40 nate verified words can be confirmed with a predeter- 
mined level of confidence based on the semantics 
analysis, it is returned and merged with the surround- 
ing character-string-elements. If the semantics ana- 
lyzer cannot determine which of the alternative veri- 

45 f ied words is correct, it returns the alternatl Ve-word- 
element (and included verif ied-word-elements) as 
such, and can include data indicative of the probabil- 
ity that each verified word therein is the correct word. 
When the document recognizer includes a graph- 
so ics structure image recognizer, it outputs graphics- 
elements containing coded data representative of 
graphics structures recognized in the graphics im- 
age. These structures can include: lines defined be- 
tween endpoints; circles; arcs; etc. Additionally, line 

55 thickness information can also be returned and re- 
corded. Ambiguities in the recognition process such 
as x and y direction offsets and line thickness varia- 
tions can also be recorded. This data can be used by 
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downstream higher level graphics recognition proc- 
esses to resolve any ambiguities, or to recognize 
more complex graphics structures. For example, four 
lines recognized by a low level graphics recognizer 
could be determined to be a box by a higher level 
graphics recognizer if, for example, the endpoints can 
be determined with a high degree of certainty to be 
coincident 

Additional image recognition elements are pro- 
duced for recording information relating to larger por- 
tions (or subimages) of the document image. For ex- 
ample, data related to font text blocks, frames, pages, 
documents, and large and small pieces of unresolved 
bitmap images can also be recorded. 

The invention will be described in detail with ref- 
erence to the following drawings in which like refer- 
ence numerals refer to like elements, and wherein: 

Figure 1 is a sample page image used to illustrate 

the present invention; 

Figure 2 illustrates a character-string-element for 
collecting streams of characters recognized with 
or above a predetermined confidence level; 
Figure 3 illustrates a questionable-character- ele- 
ment for collecting questionable characters rec- 
ognized with a low confidence level; 
Figure 4 illustrates a questionable-word-element 
for collecting a questionable word which contains 
characters recognized with high confidence, but 
which was not found in a lexicon; 
Figure 5 illustrates verified-word-elements for 
collecting verified words found in a lexicon by re- 
solving a word containing one or more question- 
able characters, and an alternative-word-ele- 
ment for collecting alternative words when two or 
more verified words are found for one word con- 
taining questionable characters; 
Figure 6 illustrates a text-element for collecting 
text elements having the same font; 
Figure 7 illustrates a fontDef-element for collect- 
ing data relating to a font type; 
Figure 8 illustrates one type of graphics-element 
which is a segment-element for collecting data re- 
lating to a line segment; 

Figure 9 illustrates another type of graphics- ele- 
ment which is an arc-element for collecting data 
relating to an arc; 

Figure 10 illustrates another type of graphics- 
element which is an image-element for collecting 
data relating to a large unresolved bitmap image; 
Figure 11 illustrates another type of graphics- 
element which is a spot-element for collecting in- 
formation relating to a small unresolved bitmap 
image referred to as a spot, and for storing this 
information as a hexidecimal value; 
Figure 12 illustrates examples of elements refer- 
ring to other elements; 

Figure 13 illustrates a tBIock-element for collect- 
ing information relating to blocks of text; 



Figure 14 illustrates a frame-element for collect- 
ing information relating to frames which can in- 
clude text blocks, images, spots, arcs and seg- 
ments, as well as other frames; 
5 Figure 15 illustrates a page-element for collecting 

data relating to a page; 

Figure 16 illustrates a group-element for collect- 
ing information relating to a group of elements 
which extend across page boundaries; 

w Figure 17 illustrates a drStream-element for col- 
lecting data relating to an entire document; 
Figures 18A-C are a collection of all syntax nec- 
essary for describing a document; 
Figure 19 is a block diagram of a system for input- 

15 ting and converting a bitmap image into coded 

data streams using the present invention; 
Figure 20 is a flowchart illustrating a procedure 
performed by the system of Fig. 19 when using 
the present invention; and 

20 Figure 21 is a flowchart illustrating a procedure 

performed by the word recognizer of Fig. 19 
when using the present invention. 
The present invention utilizes a straight forward 
procedure for recording ambiguities through the suc- 

25 cessive stages of the document recognition process. 
These ambiguities are in the context of: 

characters processed by character recogniz- 
ers; 

words processed by character recognizers, 

30 spelling checkers, and semantics analyzers; 

text flow processed by logical structure recon- 
structers; and 

geometry of line segments and arcs processed 
by graphics recognizers. 

35 Each of these processes produce and/or con- 
sume a byte-oriented data stream (hereinafter refer- 
red to as a document recognition stream or 
DRstream), and bitmap streams (hereinafter referred 
to as image files), referenced by the DRstream. The 

40 DRstream carries information about one or several 
pages of a digitized document. The information de- 
scribes text with font, certain graphics primitives, and 
half tone images, as well as their relationships, and 
the ambiguities about them. 

45 The present invention does not provide any new 
document recognition processes (or document recog- 
nizers) in the sense that it can be used with existing 
recognizers which recognize, for example, characters 
or graphics structures, or determine words (by com- 

so paring sequences of characters against a lexicon of 
known words), or determine which word from a choice 
of possible words is correct. However, the present in- 
vention improves the efficiency and compatibility 
with which these different types of recognizers func- 

55 tion by providing a standard notation for recording the 
results obtained by the recognizers in a document de- 
scription language. 

Figures 2-18C illustrate this document recogni- 
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tion notation in ISO 8879 Standard Generalized Mark- 
up Language (SGML), according to the Document 
Type Definition discussed below. Each recognizer re- 
cords coded data, corresponding to the results of the 
recognition process which it performs, as coded infor- 
mation, referred to in SGML as elements. Each ele- 
ment contains coded data which has been recognized 
as being similar in some way (for example: text, 
graphics, same page, all certain characters, etc.). 
Each element includes: a) a type-identifier which in- 
dicates the type of coded data contained in that ele- 
ment; b) an optional identification number, unique 
amongst all similar type elements of a document, 
which distinguishes that element from other similar 
type elements so that an element can be referenced 
by other elements (most elements will have an iden- 
tification number); c) coded data obtained by the 
document recognition process (this could be strings 
of characters or parameters defining graphics struc- 
tures); and d) optional contents (referred to as attri- 
butes) for providing additional information (for exam- 
pie, uncertainty information) about the coded data in- 
cluded in that element. Although the attributes of an 
element can be used to record uncertainty informa- 
tion about coded data in an element (information such 
as, for example, levels of confidence with which the 
coded data was recognized, or possible offsets for 
parameters (e.g. endpoints defining a line segment) 
of a graphics structure), the type-identification in 
some cases also serves to convey uncertainty infor- 
mation by indicating thatthe contents of that element 
was determined with a level of confidence below a 
predetermined level of confidence. In the illustrated 
examples, the coded data is recorded as human read- 
able ASCII, however other codes could also be used. 

One familiar with SGMLwill understand the gen- 
eric contents of the elements to be described below. 
Thus, only a brief discussion of a generic element will 
be provided with reference to Figs. 18A-C. Then, 
each type of element will be specifically described 
with reference to Figs. 2-17. Figs. 18A-C illustrate a 
complete syntax of elements which can be used to de- 
scribe a document according to the present invention. 
This list of elements would be located at the start of 
each DRstream, and would be used by conventional 
parsers, programmed to parse streams written in 
SGML, to parse the DRstream contained therebelow. 
That is, after the syntax list of elements, a continuous 
stream of elements describing a specific document 
would be provided. As used herein, the terminolo- 
gy 'continuous stream of elements' refers to a group 
of elements which are identified as belonging togeth- 
er. Thus, in a markup language such as SGML, where 
white-space is permited (and, in fact, encouraged for 
readability), tabs, breakage into separate lines consti- 
tute white-space that the parser will ignore. In this 
sense, white-space is part of the continuous stream 
of elements. Other systems may have a limit on the 



size of character streams. In these systems, long 
DRstreams would be split across several files which 
would be identified as belonging together. Such a 
DRstream, where several files are identified as be- 

5 longing together, is also intended to be covered by 
the terminology "continuous stream of elements". 
(Some of the elements in Figs. 18A-C include attri- 
butes (to be described below) which also would be 
listed at the start of the DRstream.) Of course, all of 

w the elements listed in Figs. 1 8A-C are not required to 
record the results of a document recognition process; 
however, when more elements are provided, more in- 
formation can be recorded. Referring to Fig. 2, in 
SGML: the terminology "'ELEMENT s" means "define 

15 an element whose type is V"; the terminology"-©" 
means "the element begins whenever its type identi- 
fier appears bracketed < >, the element ends with </> 
(element-end marker), or when another element be- 
gins at the same or a higher level in the nesting struc- 

20 ture"; and "(#PCDATA)" means "the contents of this 
element is a character string". Thus, Fig. 2 defines an 
element containing a character string (such as 
"horse") which would be recorded as follows: 
<s>horse</s>; or 

25 <s>horse</>; or 

<s>horse 

Other possible contents of a element can be other 
elements (see, for example, the aw element of Fig. 5 
which includes two or more vw elements as its con- 

30 tents), or only attributes (represented by EMPTY and 
a attribute list - see Fig. 8). The terminology " + " in- 
dicates that the immediately preceding item can be 
repeated. These definitions will become more clear 
as each element is defined in more detail below. 

35 Figure 1 is a sample page image which will be 

used to illustrate the types of bitmap images which 
can be transformed and recorded, and their form of 
recordation, using the present invention. The sample 
image includes various interesting features, such as: 

40 characters hard to recognize because of their poor 
shape or poor quality; structured graphics in the form 
of two line segments; bitmap graphics in the form of 
some undefined drawing; logical structure in the form 
of footnote and its callout character. 

45 Figure 2 illustrates a character-string-element (s) 

into which a character recognizer collects characters 
that meet the following conditions: 

ail characters have been recognized with a 
, high confidence level (having at least a predeter- 

so mined level of confidence); 

all characters have the same font, baseline 
position and underlining status; and 

there is no significant white gap between each 
character (for instance, characters that are horizon- 

55 tally aligned but belong to two columns of text, sepa- 
rated by a certain amount of white space, are not put 
together in the same element). The illustrated type- 
identifier is "s". Character-string-elements do not 



9 



EP 0 549 329 A2 



10 



have id numbers, but instead can be placed in larger 
elements. 

With reference to the Figure 1 image, a portion of 
that image having a series of characters recognized 
with at least a predetermined level of confidence by 5 
a character recognizor would be recorded using the 
present invention implemented in SGML as follows: 

<s>Etymologies appear in square brack- 
ets[]following</> 

<s>the "definitions". In accordance with 10 

the</> 

Figure 3 shows a questionable-character-ele- 
ment (qc) where a character recognizer places char- 
acters t hat have a low certainty of being correctly rec- 
ognized. Existing character recognizers currently de- 15 
termine a level of confidence for each character. If a 
character is not recognized with at least a predeter- 
mined level of confidence, these character recogniz- 
ers somehow tag the character. However, bringing an 
uncertain character to the attention of the user is an- 20 
other matter. Some vendors have an interactive pack- 
age where recognizing and asking a user for guidance 
are intertwined; it is not known whether these sys- 
tems tag uncertain characters as such, because it is 
an internal matter, and the uncertainty is lifted right 25 
away by user intervention. Other vendors merely tag 
the uncertain characters, say with a pair of question 
marks, creating the problem that the next process 
down the line cannot distinguish these question 
marks from genuine ones. However, questionable 30 
characters are not recorded in a manner that can be 
used by other machines. (That is, question marks and 
highlighting may have some other meaning.) Thus, 
when this data is passed to a higher level device such 
as a spelling checker, the spelling checker will not be 35 
able to utilize the information that the character was 
not recognized with a high degree of certainty. 

In the present invention, a higher level device re- 
ceives the information that a character was not rec- 
ognized with a high degree of certainty since every 40 
character located in a questionable-character- ele- 
ment contains that characteristic. Thus, by using a no- 
tation in a document description language to record 
ambiguities, other recognizers can utilize uncertainty 
information. Preferably, each qc element carries one 45 
questionable character. The qc element could also 
contain a list of alternative characters if the character 
recognizer identifies more than one possible charac- 
ter below the predetermined confidence level for a 
particular portion of a bitmap image. Additionally, the so 
degree of certainty for the one or plurality of question- 
able characters can also be provided in each qc ele- 
ment. Ideally, questionable-character-elements are 
subsequently eliminated by a spelling checker. 

For example, the system described in the above 55 
incorporated paper by Simon Kahan et al could be 
used to generate alternative characters (or words), 
each having some type of measure indicative of the 



level of confidence associated with that character (or 
word). However, unlike the system disclosed by Ka- 
han et al, this information relating to characters 
and/or words would be recorded in appropriate, dis- 
tinct elements using a document description lan- 
guage according to the present invention. This would 
enable other, higher level document recognizing 
processes (which may be separate from and used at 
a time separate from the Kahan et al system) to ac- 
cess this information in a uniform way. The present 
invention also permits existing recognizers to operate 
in a more efficient manner. For example, by distin- 
guishing between certain and uncertain characters 
(or words), more complicated and time consuming 
recognition procedures can be limited to the uncer- 
tain characters (or words). 

Figure 4 illustrates a questionable-word-element 
(qw) into which a word recognizer (e.g. spelling 
checker) places words that contain letters recognized 
with a high level of confidence, but which are not 
found in the lexicon of the word recognizer. There is 
one questionable word per qw element These ques- 
tionable words can be resolved by other word recog- 
nizers which include different lexicons, or by some 
other means (such as a semantics analyzer), to be 
described below. 

With reference to Fig. 1, suppose all the charac- 
ters in the word "Jumblatt" were confidently recog- 
n ized, but t he spell checker of t he word recognizer d id 
not find the word "Jumblatt" in its lexicon. It would be 
recorded in a qw element as follows: 
<qw>Jumblatt</>. 

Figure 5 illustrates a verif ied-word-element (vw) 
and an aJternatlVe-word-element (aw) into which a 
word recognizer places words which are found in its 
attempt to eliminate questionable-character-ele- 
ments. The word recognizer looks for words in a lex- 
icon for each occurrence of a questionable character 
based upon the word associated with a questionable- 
character-element. If a word is found in its lexicon, the 
word recognizer places that word in a vw element. 
When the word recognizer tries to eliminate question- 
able characters, it may find several words, verified in 
its lexicon, rf the word recognizer cannot decide be- 
tween the verified words, it places each of them in a 
vw element and places the set of vw elements in an 
aw element for the benefit of a downstream process 
such as the semantics analyzer. 

The semantics analyzer would then attempt to 
determine which of the verified words is correct by 
analyzing the words surrounding each occurrence of 
alternative words. 

The word recognizer could use various conven- 
tional processes for selecting words to be compared 
with a lexicon. For example, every letter of the alpha- 
bet could be substituted for the questionable-charac- 
ter-element in the word containing that questionable- 
character-element, and these results searched in the 
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lexicon. If alternative questionable characters are 
provided in a questionable-character-element, the 
substitution could be limited to only the alternative 
questionable characters. If no verified words are found, 
the questionable-character-element would remain, and 
optionally, the uncertainty information contained there- 
in could be updated by the word recognizer. 

Consider, for example, the two strings and ques- 
tionable character from Fig. 1 illustrated below found 
by a character recognizer: 

<s>the origins of numerous English w</s> 

<qc>a</q> 

<s>rds are still obscure</s> 
the word recognizer, trying to reduce the question- 
able "a", find "wards" and "words" as candidates and 
replaces the above notation by: 

<s>The origins of numerous English</s> 

<aw><vw>wards<Aw><vw>words<^vw></aw> 

<s>are still obscure</s> 
The stream of elements could be supplied to a se- 
mantics analyzer which would attempt to determine 
which word was correct. If the semantics analyzer 
can determine which word is correct, it merges that 
word into the surrounding s-elements. For example, 
assume the following data is provided to the semant- 
ics analyzer. 

<s>, the origins of numerous English </s> 

<aw> 
<vw>wards</> 
<vw>words</> 

<s> are still obscure. </> 
and it decides from the context that "words", not 
"wards", is the correct choice. It can replace the 
above notation by any of the choices below (it does 
not really matter which choice is selected, however 
the first choice is the most logical and the second 
choice is the most expedient): 
<s>, the origins of numerous English words are still 
obscure.</> 

<s>, the origins of numerous English <s>words<s> 
are still obscure.</> 

<s>, the origins of numerous English words<s> are 
still obscure.</> 

<s>, the origins of numerous English <s> words are 
still obscure.</> 

It should be noted that the intermediate </>s have 
been omitted since they are optional. 

Figure 6 illustrates a text-element which is used 
to collect character data (s, aw, qc and qw elements) 
of the same font. A text element has an id attribute, al- 
lowing it to be referenced by higher elements, and an 
optional reference to a font identifier (defined below). 
If the font reference is not supplied, the most recently 
supplied one is used. The text-elements are produced 
by character recognizers that can discern different 
fonts. An example of data recorded in a text-element 
is as follows: 

<text id = 123 font = 2>list of s, aw, qc and qw 



elements </> 

Figure 7 illustrates a fontDef-element. Typefaces 
analyzed by the character recognition process are re- 
corded in fontDef elements with as much information 

5 as possible. The contents of a fontDef element is the 
font family name, if the character recognizer is able to 
derive it with confidence. In the likelihood that that 
font name cannot be derived, the contents is left emp- 
ty; it could be filled later, either by a downstream proc- 

10 ess or interactively by a user. 

The id-attribute enables text elements to refer- 
ence font descriptions. The size-attribute is measured 
in points. The base-attribute indicates whether the 
base line is offset by superscripting or subscripting. If 

15 there is underlining, the under-attribute indicates the 
position of the underline below the base line of the 
font An example of data recorded in a fontDef-ele- 
ment where the font family name is Frutiger is as fol- 
lows: 

20 <fontDef id = 2 size = 10 under = 1 >Frutiger</> 

Note that the attributes are recorded in the first set of 
brackets < >. 

Figure 8 illustrates a segment-element which is 
one type of graphics-element. Segment-elements are 

25 used by the graphics recognizer to note line seg- 
ments it recognizes from the bitmap image. The id-at- 
tribute enables higher elements to reference the seg- 
ment-element The coordinates of the ending points 
(x1, y1 andx2, y2), relative to the top left corner of the 

30 page, define the segment mathematically. The un- 
certainty about the exact ending point coordinates is 
recorded in the dx1, dy1 and dx2, dy2-attributes. 
Thus, dx1, dy1, dx2 and dy2 record possible offsets 
of the parameters (x1 , y1 , x2, y2) used to describe the 

35 line segment graphics structure. The segment thick- 
ness and its uncertainty are noted by the thick and 
dThick-attributes. An example of data recorded in a 
segment-element is provided below: 
<segment id = 14x1 =2100dx1 = 5y1 =1440x2 = 2100 

40 dx2 = 5 y2 = 2160thick = 1 7> </> 

As with the fontDef-element, the attributes are pro- 
vided within the first set of brackets. Since the seg- 
ment-element does not contain any character strings 
(its content is EMPTY), the first set of brackets is fol- 

45 lowed by an element-end marker </> or, since ele- 
ment-end markers are not required by a new element 
Figure 9 illustrates an arc-element, which is an- 
other type of graphics-element Arc-elements are 
used to note circles, circular arcs, ellipses and ellipt- 

50 ical arcs recognized from the bitmap image by the 
graphics recognizer. The id-attributes enable higher 
level elements to reference the arc. The other attri- 
butes are: 

x, y, dx, dy: coordinates and uncertainty at the 
55 center of the circle, ellipticle arc, measured from the 
top left corner of the page; 

r,dr length and uncertainty of the radius of arc 
of a'circle, or long axis of the arc of an ellipse; 
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rShort, drshort: length and its imprecision of 
the short radius of the arc of an ellipse; 

thetal, dThetal: angle between the vertical 
axis and the line passing through the center and one 
of the end points of the arc. This attribute is present s 
for arcs only. The angle can be measured in milliradi- 
ans; 

theta2, dTheta2: same as thetal , dThetal for 
the other end point; 

thetaO, dThetaO: angle between the vertical 10 
axis and the long axis of an ellipse. This attribute is 
present for ellipses and elliptical arcs only; 

thick, dThick: thickness and uncertainty of the 
arc, circle or ellipse. 

An example of data recorded in an arc-element is 15 
as follows: 

<arc id = 5462x = 2300 dx = 8y = 1 440 dy = 8r = 2 1 0Odr 
= 15></> 

Figure 10 illustrates an image-element which is a 
third type of graphics-element. The image element is 20 
used to denote a rectangular area of the page that 
has not been resolved as text or structured graphics, 
and is therefore left in bitmap form in a separate file. 
The image element contains the name of the file. 

The image element attributes encode the position 25 
and uncertainty relative to the top left corner of the 
page (x, dx, y, dy) and the dimensions (w, dw, h, dh) 
of the image. The resolattribute is expressed in bits 
per unit of measurement (the units of measurement is 
supplied by the dsStream element, defined later). 30 

At the onset of the document recognition opera- 
tion, the DRstream usually contains only image ele- 
ments, one per digitized page of the paper document. 
Gradually, as character strings, line segments and 
arcs are extracted (using conventional techniques), 35 
the bitmaps are replaced by smaller and perhaps 
more numerous ones. At the completion of the oper- 
ation, the only bitmaps left are the genuine halftone 
images and the portions of the document that the 
character recognizer and graphics recognizer could 40 
not decipher. 

A bitmap image stored in a file named "Squiggle" 
would be recorded as follows: 
<image id = 567x = 1840 y = 1680 w = 260 h = 
480>Squiggle</> 45 

Figure 11 illustrates a spot-element, which is a 
fourth type of graphics-element. Spotelements con- 
tain small images and denote a very small rectangular 
area left in bitmap format: unrecognized small 
smudges, dingbats, unknown symbols, etc. The bit- 50 
map is small enough that its bitmap can be encoded 
conveniently in hexadecimal form as the contents of 
the spot-element, rather than carried in a separate 
file. 

The x, dx, y and dy-attributes supply the position 55 
of the spot with respect to the top-left corner of the 
page. The bx-attribute gives the number of bits in the 
horizontal direction. It is constrained to be a multiple 



of eight. The by-attribute gives the number of 1-bit 
high rows. When a spot element needs to be imaged, 
the hexadecimal value is consumed 8*bx bits (2*bx 
hexadecimal characters) at a time for each row. The 
hexadecimal value contains trailing 0 bits where ap- 
propriate. 

Suppose that the bullet • in the Fig. 1 sample 
page has not been recognized. It would be noted as 
a small image as follows: 

<spot id = 11 x = 590 y = 850 bx = 25 by = 25 
> O3FFB00O... </> 

Figure 12 illustrates references to other ele- 
ments. The text, segment, arc, image and spot- 
elements may be grouped together by higher-level 
elements (text blocks, frames and pages, discussed 
below), via a reference to their identifier. A reference 
to a single element is made by an item-element, the 
single attribute of which has the value of the identifier 
of the referenced element. 

A reference to a consecutive succession of ele- 
ments is made by a range element: "from 0 and "to" at- 
tributes refer to the identifiers of the first and last ref- 
erenced elements. "First" and "last" are relative to the 
chronological order in which the elements are found 
in the DRstream. A range-element is a short-hand no- 
tation for an unbroken succession of item-elements. 

Ambiguities about grouping are denoted by al- 
tern-eiements. Alternative groupings are used by 
processes to encode a number of reasonable element 
groupings. For instance, a page of text has been rec- 
ognized as having four text blocks, two on the left side 
and two on the right; the logical structure processor 
(or logical reconstructed, unable to determine if the 
text reads as two columns or as two rows, groups 
them in the order top left, bottom left, top right, bottom 
right; or the order top left top right, bottom left, bottom 
right; a downstream process like a syntactic analyzer 
might be able to resolve the ambiguity. 

Figure 13 illustrates a tBIock element. tBIock- ele- 
ments encode rectangular areas forming an invisible 
boundary around a text line or a set of equally spaced 
text lines. The location of a tBIock, relative to the top 
left corner of the page, and their imprecision are giv- 
en by the x, y, dx and dy-attributes. The dimensions 
and uncertainties are recorded by w, h, dw, and dh. 
The interl-attribute measures the interval between 
the equally-spaced lines within the block; its value is 
zero when the text block contains one line only. The 
x1 , dx1 , y1 , dy1 -attributes give the location of the first 
character in the text block, relative to the top left cor- 
ner of the block. 

Figure 14 illustrates a frame-element. A frame- 
element encodes a rectangular area, smaller than or 
equal to the page area. It is used to aggregate text 
blocks, images, spots, arcs and segment, as well as 
other frames. Frames may overlap on the page. 

Figure 15 illustrates a page-element. A page- 
element aggregates all the pieces of information 
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about a digitized page of a document. If there is no 
page-element In a DRstream, it is assumed that all 
the DRstream data belongs to a single page. 

Figure 16 illustrates a group-element. A group 
element enables a collection of the elements across 
page boundaries. It may be used by the logical struc- 
ture reconstructer and the semantic analyzer to indi- 
cate the flow of text across pages. 

Figure 17 illustrates a drStream element. At the 
top of the document type definition, is the drStream 
element Its unit-attribute gives the name of the 
measurement unit used throughout the drStream. 
The fraction-attribute indicates what fraction of the 
measurement unit the coordinates, dimensions, and 
their imprecisions actually represent For example, if 
the measures are in microns, the drStream element 
attributes are: 

<drStream unit = meter fraction = 1000000> 

Figures 18A-C illustrates all of the elements used 
in the disclosed page description language. 

Figure 19 illustrates a document recognition sys- 
tem useable with the present invention. Figs. 20 and 
21 are flowcharts illustrating procedures foroperating 
the Fig. 19 system according to the present invention. 
In order to input a bitmap image(SIOO), a paper docu- 
ment is scanned using an image scanner 100 to pro- 
duce a bitmap document image 110. It is understood 
that the scanning process can be performed essen- 
tially at the same time that the recognition processes 
are performed, or the bitmap document image 110 
can be supplied on some type of electronic storage 
medium such as a hard or floppy disk. The bitmap 
document image 110 is supplied to a conventional 
segmenter 150 (S110) which segments the bitmap 
image into smaller subimages, such as, for example 
textual subimages containing only text, and graphics 
subimages containing only graphics. The segmenter 
150 can iteratively segment the bitmap image into 
smaller subimages until each subimage is recognized 
as containing only text or only graphics. The graphics 
subimages are then supplied to a structure image rec- 
ognizer (or graphics recognizer) 200, while the textual 
subimages are supplied to a character recognizer 
300. Of course, if it is known in advance that the bit- 
map document image contains only text or graphics, 
it can be supplied directly to the structure image rec- 
ognizer 200 or character recognizer 300. 

The structure image recognizer 200 then trans- 
forms the bitmap graphics image (or subimages) into 
coded graphics data (S160) which can be recorded in 
graphics-elements, such as the graphics-elements 
described above, using a document description lan- 
guage. That is, unrecognized bitmap images are 
placed in unresolved-graphics-image-type elements 
(S170): image-elements if they are large, or spot- 
elements and represented as hexadecimal values if 
they are small. If the graphics image or subimages 
are recognized and transformed into coded graphics 



data, they are placed in one or more segment-ele- 
ments and/or arc-elements (S180). Additional infor- 
mation regarding uncertainty (for example, possible 
offsets of the parameters (coded data) describing the 

5 graphics structures optionally can be recorded in 
these elements (S185). 

Thus, when transforming graphics bitmap im- 
ages into editable coded data using the present in- 
vention, the graphics structure image recognizer 200 

10 acts as a first transformation means for performing a 
first transformation operation on the bitmap graphics 
image to transform the graphics bitmap image into 
one or more graphics elements containing coded data 
defining graphics structures; and as a first identif ica- 

15 tion means using the document description language 
for identifying the one or more graphics elements 
transformed by the first transformation means, each 
graphics-element including an element type identifier 
indicating a type of coded data regarding the recog- 

20 nized bitmap image contained in that element When 
the first transformation means determines that the 
coded data contained in the graphics-element has not 
been transformed with a predetermined level of con- 
fidence, the identification means also includes in the 

25 graphics-element uncertainty information (offsets) 
determined by the first transformation means regard- 
ing the coded data contained in each graphics ele- 
ment 

The character recognizer 300 transforms the bit- 

30 map textual image (or subimages) into coded charac- 
ter data (S120) which is then stored in the appropriate 
element in S140 or S150 (character-string or ques- 
tionable-character) as described above. In order to 
determine whether to place coded character data in 

35 a character-string-element or a questionable-charac- 
ter-element, a determination is made in S130 as to 
whether a recognized character was recognized with 
at least a predetermined level of confidence. Al- 
though the insertion of a character into a question- 

40 able-character-element serves to convey uncertainty 
information about that character, additional informa- 
tion such as alternate possible uncertain characters, 
or degrees of certainty for uncertain characters can 
also be included in a questionable-character-element 

45 (S155). Thus, the character recognizer 300 will pro- 
duce a stream of character-string-elements and ques- 
tionable-character-elements, which can then be sup- 
plied to a word recognizer 400. 

The word recognizer 400 includes a dictionary or 

so lexicon of words therein. The word recognizer 400 op- 
erating according to the present invention, would then 
perform the procedure illustrated in Figure 21 for 
each questionable-character-element. First, in S200, 
a plurality of characters are sequentially substituted 

55 for the questionable-character-element in the word 
containing the questionable-character-element. In 
S210, a determination is made as to whether any of 
the words formed by the substituting step (S200) 
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were found in the lexicon of the word recognizer 400. 
Such words are referred to as "verified words". If no 
verified words are found, the questionable-character- 
element is returned in S240, and optionally, in S245, 
the uncertainty information contained in the ques- 
tionable-character-element is updated based upon 
any determinations made by the word recognizer 
400. If the determination in S210 is positive, each 
verified word is placed in a verified-word-element 
(S220). Next, in S230, if more than one verif ied-word- 
elements are produced from a single questionable- 
character-element, the multiple verified-word-ele- 
ments are placed in an alter native- word-element. 

Each alternative-word-element can be trans- 
formed into a character-string-element by a semant- 
ics analyzer 500 which attempts to determine which 
of the verified words in an alternative-word-element 
is correct based upon surrounding words. If the se- 
mantics analyzer cannot determine which of the veri- 
fied words in an alternative-word-element are cor- 
rect it returns the aiter native- word-element, and op- 
tionally provides uncertainty information for each of 
the verified words in each verified- word-element 
therein. 

Thus, when transforming textual bitmap images 
into editable coded data using the present invention, 
the character recognizer 300 acts as a first transfor- 
mation means for performing a first transformation 
operation on the textual bitmap image to transform 
the textual bitmap image into one or more elements 
containing coded character data; and as a first iden- 
tification means using the document description lan- 
guage for identifying the one or more elements trans- 
formed by the first transformation means, each ele- 
ment including an element type identifier indicating a 
type of coded character data regarding the recog- 
nized bitmap textual image contained in the element. 
Elements containing characters not recognized with a 
predetermined level of confidence are recorded in 
elements identified by the first identification means 
as questionable-character-elements, while certain 
characters are recorded in elements identified as 
character-string-elements. 

The word recognizer 400 acts as a second trans- 
formation means for transforming each questionable- 
character-element and adjacent confidently recog- 
nized characters in a same word as the questionable- 
character-element into one or more verified words by 
substituting alternative characters for the question- 
able-character-element and verifying that a word re- 
sulting from the substitution exists in a lexicon; and as 
a second identification means using the document 
description language for placing each verified word in 
a verified-word-element. When more than one veri- 
fied-word- elements are created for a questionable- 
character-element, the second identification means 
also places the more than one verified-word-ele- 
ments in an alternative-word-element The second 



identification means maintains the questionable- 
character-element when no verified words are deter- 
mined to exist. 

The alternative-word-element can then be sup- 

5 plied to semantics anlyzer 500 wh ich acts as a means 
for determining which verified word within an alterna- 
tive-word-element is a correct verif led word based on 
words surrounding the alternative-word-element; 
and as a third identification means for identifying the 

w correct verified word, and for replacing the alterna- 
tive-word-element with a character-string-element 
containing the correct verified word. 



15 Claims 

1 . A method of transforming a document represent- 
ed as a bitmap image into an editable coded data 
stream using a document description language to 

20 record information regarding uncertainties in the 

document transformation process, comprising: 

performing a first transformation opera- 
tion on at least a portion of said bitmap image us- 
ing a first recognition apparatus, to transform at 

25 least said portion of said bitmap image into one 

or more elements, selectively identified using 
said document description language, each ele- 
ment including an element type identif ier indicat- 
ing a type of coded information regarding the rec- 

30 ognized bitmap image contained in said element, 
and, when said first recognizer determines that 
the recognized bitmap image contained in an ele- 
ment has not been recognized with at least a pre- 
determined level of confidence, each element 

35 also including uncertainty information deter- 
mined by said first recognizer regarding said rec- 
ognized bitmap image contained in said element. 

2. The method of claim 1 , wherein said uncertainty 
40 information includes a degree of uncertainty with 

which said first recognizer transformed said bit- 
map image. 

3. The method of claim 2, wherein said portion of 
45 said subimage contains a graphics bitmap image, 

said first recognizer is a graphics recognizer, said 
type identifier identifies as graphics-elements 
portions of said graphics bitmap image trans- 
formed by said graphics recognizer, and said un- 
50 certainty information includes possible offsets 

for parameters used to describe said graphics 
structures. 

4. The method of claim 2, wherein said uncertainty 
55 information includes alternative transformations 

of the recognized bitmap image contained in 
each element, and a degree of certainty for each 
alternative transformation, said alternative trans- 

10 
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formations and degrees of certainty being deter- 
mined by said first recognizer. 

5. The method of claim 1 , wherein said uncertainty 
information, includes alternative transformations 
of the recognized bitmap image contained in 
each element, said alternative transformations 
being determined by said first recognizer. 

6. The method of claim 1, wherein said portion of 
said sub image contains a text bitmap image, said 
first recognizer is a character recognizer, and 
said type identifier identifies portions of said text 
bitmap image transformed by said character rec- 
ognizer as character-string-elements or ques- 
tionable-character-elements, each character- 
string-element containing a string of consecutive 
characters recognized by said character recog- 
nizer with at least said predetermined level of 
confidence, each questionable-character-ele- 
ment containing said uncertainty information de- 
termined by said character recognizer for a char- 
acter which was not recognized with at least said 
predetermined level of confidence by said char- 
acter recognizer. 



tionable-character-element, said uncertainty in- 
formation pertaining to a character not recog- 
nized with at least said predetermined level of 
confidence includes a most likely uncertain char- 
5 acter, and/or alternative possible uncertain char- 

acters, identified by said character recognizer. 

10. The method of claim 11, wherein for each ques- 
tionabie-character-element, said uncertainty in- 
fo formation pertaining to a character not recog- 
nized with at least said predetermined level of 
confidence also includes a degree of confidence 
determined by said character recognizer for said 
most likely and/or alternative possible uncertain 

15 ' characters. 

11. The method of claim 1, including: 

segmenting said bitmap image into one or 
more graphics bitmap subimages and one or 
20 more text bitmap subimages, each of said one or 

more text bitmap subimages being analyzed as 
text portions; and each of said one or more graph- 
ics subimages being analyzed using a graphics 
structure analyzer to transform said one or more 
25 graphics subimages into one or more graphics- 

elements. 

1 2. An automatic document recognition apparatus for 
transforming documents represented as bitmap 
image data into an editable coded data stream us- 
ing a document description language to record in- 
formation regarding uncertainties in the docu- 
ment transformation process, said apparatus 
comprising: 

a first recognizer having: 

a) first transformation means for performing a 
first transformation operation on said bitmap 
image representation of said document to 
transform said document into one or more ele- 
ments containing coded data; and 

b) first identification means using said docu- 
ment.description language for identifying said 
one or more elements transformed by said 
first transformation means, each element in- 
cluding an element type identifier indicating a 
type of coded data regarding the recognized 
bitmap image contained in said element, and, 
when said first transformation means deter- 
mines that the coded data contained in the 
element has not been transformed with a pre- 
determined level of confidence, said identifi- 
cation means also including in said element 
uncertainty information determined by said 
first transformation means regarding said 
coded data contained in said element 



7. The method of claim 6, further comprising: 

for each questionable-character-element, 
using a word recognizer to transform said ques- 
tionable-character-element and adjacent confi- 30 
dently recognized characters in a same word as 
said questionable-character-element into one or 
more verified- word-elements by substituting al- 
ternate characters for said questionable-charac- 
ter-element when one or more words created by 35 
said substituting are recognized by said word rec- 
ognizer; when more than one verified-word-ele- 
ment is transformed for each questionable-char- 
acter-element, said more than one verif ied-word- 
elements being placed in an alternate-word-el e- 40 
ment; said questionable-character-element re- 
maining when no verified words are recognized 
by said word recognizer. 

8. The method of claim 7, further comprising: 45 

for each alternative-word-element, using a 
semantics analyzer to transform verified words of 
the verified-word-elements contained in each al- 
ternate word element into a character-string-ele- 
ment corresponding to one of the verified words so 
contained in said alternative-word-element when 
said semantics analyzer determines that said 
one of said verified words is a correct word, said 
alternative-word-element remaining when none 
of said verified words is determined to be a cor- 55 
rect word by said semantics analyzer. 

9. The method of claim 6, wherein for each ques- 
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13. The apparatus of claim 12, wherein said uncer- 
tainty information includes a confidence level 



BNSDOCID- <EP 0549329 A2_l_> 



21 



EP 0 549 329 A2 



with which said first transformation means deter- 
mined said coded data. 

14. The apparatus of claim 12, wherein said uncer- 
tainty information includes coded data alterna- 5 
tives determined by said first transformation 
means for a portion of said bitmap image. 
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§ Etymologies 



Etymologies appear in square brackets [ ] following 
the "definitions". In accordance with the Dictionary's 
policy of eliminating special symbols k dingbats... 

• Obscure Origin. According to the noted 
linguist Jumblatt, the origins of numerous English 
words are still obscure ... 




AHD 



t The American Heritage Dictionary 
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OELEMENT s - 0 (fPCDATA) — chor string— > 

FIG.2 



<!ELEMENT qc - 0 (fPCDATA) —questionable char— > 

FIG.3 



<!ELEMENT qw - 0 (fPCDATA) — questionable word— > 

FIG.4 



C'ELEMENT vw 


- 0 (fPCDATA) 


— verified word — > 


OELEMENT aw 


- 0 (vw, vw+) 


— alternate words — > 




FIG.5 
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<! ELEMENT text 




.0 1 s | aw | qc | qw )* 


> 


<! ATTL 1ST text 








id 


ID 


# REQUIRED 




font 


IDREF 


# CURRENT 


> 



FIG.6 



<! ELEMENT fontDef -0 IttPCDATA] 

<! ATTL 1ST fontDef 

id ID ^REQUIRED 

size NUMBER 12 

weight lultraL | extraL | light| 
semiL | med i um | semi B 
bold | extraB | ultra 

medium 

posture (roman | italic! roman 
base (normal | sub | sup) normal 

under NUMBER 0 --underline position if any-> 



<! ELEMENT segment - 0 EMPTY > 

<! ATTL I ST segment 



id 


ID 


# REQUIRED 


xi 


NUMBER 


0 


dxl 


NUMBER 


0 


yi 


NUMBER 


0 


dyl 


NUMBER 


0 


x2 


NUMBER 


0 


dx2 


NUMBER 


0 


y2 


NUMBER 


0 


dy2 


NUMBER 


0 


thick 


NUMBER 


0 


dThick 


NUMBER 


0 



-font family name— > 



-font size in points— 
—font weight— 



-printing type style— 



FIG.8 
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<! ELEMENT arc 




-0 EMPTY 


> 


<!ATTL 1ST ore 








id 


ID 


REQUIRED 




X 


NUMBER 


0 




dx 


NUMBER 


0 




y 


NUMBER 


0 




dy 


NUMBER 


0 




r 


NUMBER 


0 




dr 


NUMBER 


0 




rShort 


NUMBER 


0 




drShort 


NUMBER 


0 




thick 


NUMBER 


0 




dThick 


NUMBER 


0 




thetaO 


NUMBER 


0 




dThetaO 


NUMBER 
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